[reconfigurator] Add "expunged and unreferenced" zones to `PlanningInput` #9730

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

jgallagher wants to merge 8 commits into main from john/planning-input-pruneable

Contributor

jgallagher commented Jan 27, 2026

This builds on the BlueprintExpungedZoneAccessReason added in #9608, and adds a new set of zone IDs to PlanningInput for "expunged and unreferenced" zones. These are zones that the planner is safe to prune in a future planning iteration, because they are:

Expunged and ready for cleanup
No longer referenced in CRDB, implying any necessary cleanup is done

This PR only adds the new set of zones to PlanningInput. It will be nearly trivial to add a step to the planner to "prune all zones present in the planning input's expunged_and_unreferenced set", but out of an abundance of caution I'd prefer to land this first, collect a reconfigurator state from some deployed racks, and confirm the contents match what we'd expect on some real systems.

This is a little weird in that the logic for whether a zone can be pruned lives in the "prepare a planning input" crate rather than the planner itself, but I think this is reasonable given the implementation of several of the checks (e.g., "query CRDB for this expunged zone specifically" to see whether it's still referenced, which the planner can't do since it doesn't have access to CRDB). Happy to discuss alternatives if this seems wrong upon review.

(We may also want to wait on #7278 before adding the actual prune step?)

jgallagher added 8 commits

January 27, 2026 15:21


          add expunged_and_unreferenced

1ede6a9


          input/builder helper methods

73da04c


          initial collecting "expunged and unreferenced"

4a4f0a0


          flesh out oximeter reference check

97d3b23


          flesh out is_nexus_referenced()

c2eb0e3


          docs on static_check_all_reasons_handled()

5cad9b2


          cleanup and comments


          cargo fmt

c80d7e6

jgallagher requested review from davepacheco and smklein

January 27, 2026 20:35

jgallagher commented

View reviewed changes

nexus/types/src/deployment/planning_input.rs

+                      // We have no way to confirm that this zone is "unreferenced" - that's a
+                      // property of the system at large, mostly CRDB state - but we can
+                      // confirm that it's expunged and ready for cleanup by looking at the
+                      // parent blueprint.

Contributor Author

jgallagher Jan 27, 2026

I was originally going to add a couple blippy lints that any zone in expunged_and_unreferenced is (a) present and (b) actually expunged, but with the guard in this method and the fact that PlanningInput's fields are private, I don't believe it's actually possible to construct a PlanningInput with any such zone IDs in expunged_and_unreferenced, so it was impossible to write a test to confirm the blippy lints triggered.

davepacheco reviewed

View reviewed changes

Collaborator

davepacheco left a comment

Nice! This is really clean. I've commented on one possible bug here and the rest is nitty feedback.

nexus/db-queries/src/db/datastore/saga.rs

+                          .map_err(|e| public_error_from_diesel(e, ErrorHandler::Server))
+                  }
+                  /// many queries as needed (in batches) to get them all

Collaborator

davepacheco Jan 29, 2026

I think this is a stray line

nexus/reconfigurator/preparation/src/expunged_and_unreferenced.rs

+              //! Helpers for identifying when expunged zones are no longer referenced in the
+              //! database.
+              //!
+              //! When a zone is expunged, the expunged zone is still present in the

Collaborator

davepacheco Jan 29, 2026

Suggested change

      
            //! When a zone is expunged, the expunged zone is still present in the
          
            //! When a zone is first expunged, the expunged zone initially remains in the

nexus/reconfigurator/preparation/src/expunged_and_unreferenced.rs

+              //! blueprint. Two high-level conditions must be satisfied before it's safe to
+              //! prune an expunged zone from the blueprint (i.e., delete it entirely):
+              //!
+              //! 1. The sled-agent responsible for running the zone must confirm the zone is

Collaborator

davepacheco Jan 29, 2026

I think item 1 here is an "or" of either (1) what you have here, or (2) the sled is expunged.

nexus/reconfigurator/preparation/src/expunged_and_unreferenced.rs

+              //! 2. Any cleanup work that operates on expunged zones must be complete. This
+              //!    is zone-type-specific. Some zone types have no cleanup work at all and
+              //!    can be pruned as soon as the first condition is satisfied. Others have
+              //!    multiple, disparate cleanup work, all of which must be completed.

Collaborator

davepacheco Jan 29, 2026

Suggested change

      
            //!    multiple, disparate cleanup work, all of which must be completed.
          
            //!    multiple, disparate cleanup steps, all of which must be completed.

nexus/reconfigurator/preparation/src/expunged_and_unreferenced.rs

+              use std::net::IpAddr;
+              use std::num::NonZeroU32;
+              pub(super) async fn find_expunged_and_unreferenced_zones(

Collaborator

davepacheco Jan 29, 2026

Feels important to document this function.

nexus/reconfigurator/preparation/src/expunged_and_unreferenced.rs

+                      }
+                  }
+                  fn in_service_boundary_ntp_upstream_configs(

Collaborator

davepacheco Jan 29, 2026

Having looked at how this is used: why not compute these two sets once in find_expunged_and_unreferenced_zones() and then pass them to the two helpers that use this object?

This isn't bad or anything -- it just took me a minute to figure out what was going on with it and I think in the end it's simpler than it looks.

nexus/reconfigurator/preparation/src/expunged_and_unreferenced.rs

+                  // it's still "referenced" (in that the planner needs to refer to it to set
+                  // up a new boundary NTP zone).
+                  let expunged_config = BoundaryNtpUpstreamConfig::new(boundary_ntp);
+                  bp_refs

Collaborator

davepacheco Jan 29, 2026

This looks backwards to me. This returns whether an in-service zone has the same upstream NTP config as the expunged zone. That's the case where the zone is not referenced any more, right?

(If that's right, there's probably also a test gap here.)

Contributor Author

jgallagher Jan 29, 2026

Yeah, great catch.

(If that's right, there's probably also a test gap here.)

Agreed - I was planning on manually testing this first by grabbing reconfigurator-cli state from a system (once this landed) and then figuring out where those tests might live. I don't think putting them in reconfigurator-cli (or using the underlying simulator) will work, because those don't have access to a live CRDB, and the majority of logic in this module requires querying for not-blueprint bits in the db. Maybe just unit tests here, and know that there will be quite a lot of boilerplate to test all these cases?

nexus/reconfigurator/preparation/src/expunged_and_unreferenced.rs

Comment on lines +184 to +185

		clickhouse_config.keepers.contains_key(&zone_id)
		\|\| clickhouse_config.servers.contains_key(&zone_id)

Collaborator

davepacheco Jan 29, 2026

I don't know enough about this to evaluate this case -- may want to ask @karencfv or @andrewjstone to double-check.

nexus/reconfigurator/preparation/src/expunged_and_unreferenced.rs

Comment on lines +192 to +196

+                  // Check BlueprintExpungedZoneAccessReason::ExternalDnsExternalIps; we
+                  // consider an external DNS zone "still referenced" if its IP is _not_
+                  // assigned to an in-service external DNS zone. (If the IP _is_ assigned to
+                  // an in-service external DNS zone, the expunged zone is no longer
+                  // referenced and can be safely pruned.)

Collaborator

davepacheco Jan 29, 2026

General nit and feel free to ignore: I'd suggest instead of framing all these in terms of "referenced", framing them in terms of "needed", and then changing a comment like this to say "We only need to keep track of an expunged external DNS zone if its external IP has not yet been re-used (because this is currently the only way we keep track of which external IPs the external DNS zones should be using)."

That is: "referenced" feels like a special case of "needed". It applies to Oximeter and multi-node Clickhouse and potentially others. But it seems a little wrong to shoehorn a case like this into the term "referenced".

(Obviously not a big deal)

nexus/reconfigurator/preparation/src/expunged_and_unreferenced.rs

Comment on lines +208 to +211

+                  if !datastore
+                      .database_nexus_access_all(opctx, &BTreeSet::from([zone_id]))
+                      .await?
+                      .is_empty()

Collaborator

davepacheco Jan 29, 2026

Don't we already load this information separately for the planning input?

(Not sure it's worth optimizing if we do, but figured I'd mention it. I don't think there's a correctness issue if we get different results? But worth being sure.)

Contributor Author

jgallagher Jan 29, 2026

Not quite - we only load rows in the DbMetadataNexusState::{Active,NotYet} states. But this check needs to confirm there aren't rows in any state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet